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Abstract. This paper describes a novel approach to grammar induc- 
tion that has been developed within a framework designed to integrate 
learning with other aspects of computing, AI, mathematics and logic. 
This framework, called information compression by multiple alignment, 
unification and search (ICMAUS), is founded on principles of Minimum 
Length Encoding pioneered by Solomonoff and others. Most of the paper 
describes SP70, a computer model of the ICMAUS framework that in- 
corporates processes for unsupervised learning of grammars. An example 
is presented to show how the model can infer a plausible grammar from 
appropriate input. Limitations of the current model and how they may 
be overcome are briefly discussed. 

1 Introduction 

This paper describes a novel approach to unsupervised grammar induction that 
has been developed within a research programme whose overarching goal is the 
integration of diverse functions — learning, recognition, reasoning and others — 
within one relatively simple framework. This has had a substantial impact on 
the way in which the learning processes are organised. 

The new framework called information compression by multiple alignment, 
unification and search (ICMAUS) originated in earlier research developing the 
SNPR model of grammar induction |25I24| . Without supervision, the SNPR 
model successfully learns artificial context-free phrase-structure grammars (CF- 
PSGs) using a technique of 'hierarchical chunking' combined with a search for 
disjunctive (part of speech) categories and processes for generalising grammatical 
rules and correcting over-generalisations. 

In the ICMAUS programme, the aim has been to match or exceed these 
capabilities within a system that has been generalised to model a range of other 
aspects of computing, AI, mathematics and logic. It became apparent at an 
early stage that this would mean a radical reorganisation of the SNPR model. 
In the ICMAUS framework a concept of multiple alignment — to be described — 
has replaced hierarchical chunking as the predominant mode of organisation. 
With this new orientation, the system provides an interpretation for concepts in 



computing, mathematics and logic and it has a range of AI capabihties described 
in |28j and earher papers cited there. The present paper describes how the system 
has been developed for unsupervised learning of grammars. 

A much fuller account of the research described here may be found in |27|. 
available from http: / /www.cognitionresearch.org.uk/papers/ul/ul.htm 



1.1 Relationship with Other Research on Grammar Induction 

This research extends the tradition of distributional linguistics pioneered by |8l()j 
and others. 

At the heart of ICMAUS system are principles of Minimum Length Encoding 
(MLE) pioneered by JH| (see also In this framework, grammar induction 

is conceived as a process of optimisation rather than a process of identifying a 
target grammar 'in the limit' as postulated by [71. In the MLE framework, there 
is no target grammar, merely a process of searching for grammars that are 'good' 
in terms of MLE principles. 

Recent studies that are, perhaps, most closely related to the present re- 
search include: '1'2 3 5 9 lOTL ;14"15'16"17'21"20'19|. Not all of these studies have 
adopted MLE principles but they deal with issues and processes that relate to the 
present research. The idea of combining learning with parsing — to be described — 
has also been developed by Nakamura (see J3| and this workshop). 

Compared with other work on unsupervised learning of grammar-like struc- 
tures, the most distinctive features of the ICMAUS research are: 

— The integration of learning with other areas of AI, computation, mathematics 
and logic. 

— The multiple alignment concept as it has been developed in the IC- 
MAUS framework, described below. There is, however, a clear affinity with 
'alignment-based learning' j21l2()| . 



2 The ICMAUS Framework 



In the ICMAUS framework, all knowledge is stored as patterns: arrays of symbols 
in one or two dimensions.^ Despite the simplicity of this format, it is possible 
within the ICMAUS system to represent several different kinds of knowledge 
including context-free and context sensitive grammars, networks, trees, if-then 
rules and others. 

Given the generality of this format for knowledge, the learning techniques 
described in this paper are relevant to the learning of any kind of knowledge, 
not just 'grammars', narrowly conceived. 

The ICMAUS framework is intended as an abstract model of any kind of sys- 
tem for computing or cognition, either natural or artificial. In broad terms, the 
system works by receiving 'New' information from its environment and transfer- 
ring it to a repository of 'Old' information. At the same time, it tries to compress 

^ In work to date, the focus has been on one-dimensional patterns. 



the information as much as possible by finding patterns that match each other 
and merging or 'unifying' patterns that are the same. In these broad terms it 
is similar to a ZIP program but it differs in the thoroughness of the search for 
'good' unifications of patterns and in the 'multiple alignment' concept, to be 
described. 



2.1 Multiple Alignment 

The concept of multiple alignment in the ICMAUS framework has been borrowed 
from the field of bio-informatics and adapted as described in |^ . 

An example of an ICMAUS multiple alignment is shown in Figure ^ Row 
contains the New pattern 'oneofthemdoes' and all the other rows contain 
Old patterns, one pattern per row. By convention, the New pattern is always 
shown in row but otherwise the assignment of patterns to rows is entirely 
arbitrary. 
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Fig. 1. A multiple alignment with 'oneofthemdoes' in New and patterns 
representing grammatical rules in Old. 



Apart from the pattern in row 8, the patterns from Old in this example are 
like re-write rules in a CF-PSG with the re-write arrow omitted. If we ignore 
row 8, the alignment shown in Figure ^ is very much like a conventional pars- 
ing, marking the main components of the sentence: words and phrases and the 
sentence pattern itself (shown in row 5). 

Row 8 shows how the 'discontinuous' dependency that exists between the 
singular noun in the subject of the sentence ('Ns') and the singular verb ('Vs') 
can be marked within the alignment in a relatively direct manner. Despite the 
simplicity of the format for representing knowledge, the formation of multiple 
alignments enables the system to express 'context sensitive' aspects of language 
and other kinds of knowledge. 

In each Old pattern there are two kinds of symbols: ID-symbols like '<', 'N', 
'Np', '0' and '>' in '< N Np t h e m >' serve to identify the pattern and the 



remaining symbols ('t h e m' in this example) are C-symbols that represent the 
contents or substance of the pattern. 

Much more detail, with many more examples, may be found in |26|. 

3 SP70 

All the main components of the ICMAUS framework outlined in Section |21 are 
now realised within the SP70 software model (version 9.2). The model is able to 
abstract plausible grammars from sets of simple sentences without prior knowl- 
edge of word segments or the classes to which they belong, and the computational 
complexity of the model appears to be acceptable (Section^. However, in its 
current form, the model has at least two significant shortcomings and some other 
deficiencies, discussed briefly in Sectional 

3.1 Objectives 

In the development of this model, the main problems that have been addressed 
are: 

— How to identify significant segments in the 'corpus' of raw data when the 
boundary between one segment and the next is not marked explicitly. 

— How to identify disjunctive classes of syntactically-equivalent segments (e.g., 
'nouns', 'verbs' and 'adjectives'). 

— How to combine the learning of segmental structure with the learning of 
disjunctive classes. 

— How to learn segments and disjunctive classes through two or more levels of 
abstraction. 

— How to generalize grammatical rules beyond the data and how to correct 
over-generalizations without feedback from a 'teacher' or the provision of 
'negative' samples or the grading of the data from 'easy' to 'hard' (c/. 

Solutions to these problems were found in the SNPR model I25I24| but, as 
noted earlier, the organisation of this model is quite unsuited to the wider goals 
of the present research — integration of diverse functions within one framework. 
The SP70 model (v. 9.2) provides solutions to the first three problems and partial 
solutions to the fourth and fifth problems. Further development is planned as 
indicated in Sectional below. 

3.2 Overall Structure of the Model 

Figure 121 shows the high-level organisation of the SP70 model. 

The function create_multiple-alignments () referred to in Figure El creates zero 
or more multiple alignments, each one comprising the current pattern from New 
(CPFN) and one or more patterns from Old. This function is essentially the 
same as the main component of the SP61 model, described quite fully in |26| . 
Readers are referred to this source for a more detailed description of how multiple 
alignments are formed in the ICMAUS framework. 



SP70() 
{ 

1 Read a set of patterns into New. Old is initially empty. 

2 Compile an alphabet of symbol types in New and, for each type, 

find its frequency of occurrence and the number of bits 
required to encode it (using the Shannon-Fano-Elias method) . 

3 While (there are unprocessed patterns in New) 
{ 

3.1 Identify the first or next pattern from New as the 

'current pattern from New' (CPFN) . 

3.2 Apply the function CREATE_MULTIPLE_ALIGNMENTS() to 

create multiple alignments, each one between the 
CPFN and one or more patterns from Old. 

3.3 During 3.2, the CPFN is copied into Old, one symbol 

at a time, in such a way that the CPFN can be 
aligned with its copy but that any one symbol in 
the CPFN cannot be aligned with the corresponding 
symbol in the copy. 

3.4 Sort the alignments formed by this function in order 

of their compression scores and select the best 
few for further processing. 

3.5 Process the selected alignments with the function 

DERIVE_PATTERNS() . This function derives encoded 
patterns from alignments and adds them to Old. 

} 

4 Apply the function SIFTING_AND_SORTING() to create one or 

more alternative grammars for the patterns in New, each 
one scored in terms of MLE principles. Each grammar is 
a subset of the patterns in Old. 

} 

Fig. 2. The organisation of SP70. The workings of the functions cre- 
ate-multiple-alignments(), derive-patterns() and sifting_and-Sorting() are ex- 
plained in the text. 

3.3 Deriving Patterns from Alignments 

In operation 3.5 in Figure |21 the derive-patterns() function is appHed to a selec- 
tion of the best alignments formed and, in each case, it looks for sequences of 
unmatched symbols within the alignment and also sequences of matched sym- 
bols. 

Consider the alignment shown in Figure |3| From an alignment like that, the 
function finds the unmatched sequences 'girl' and 'b o y' and, within row 1, it 
also finds the matched sequences 'that' and 'r u n s'. With respect to row 1, 
the focus of interest is the matched and unmatched sequences of C-symbols — 
ID-symbols are ignored. 

A copy of each of the four sequences is made, ID-symbols are added to each 
copy and the copy is added to Old. In addition, another 'abstract' pattern is 
made that records the sequence of matched and unmatched patterns within the 
alignment. The result in this case is five patterns like those shown in Figure^ 



thatgirlruns 

I I I I I I I I 

l<y.l9thatboy runs>l 

Fig. 3. A simple alignment from which other patterns may be derived. 

< 7.7 12 t h a t > 

< "/.9 14 b o y > 

< "/.9 15 g i r 1 > 

< 7.8 13 r u n s > 

< 7.10 16 < 7.7 > < 7.9 > < 7.8 > > 

Fig. 4. Patterns derived from the alignment shown in Figure 13 

It should be clear that the set of patterns in Figure 01 is, in effect, a simple 
grammar for the two sentences in Figure |31 with patterns representing gram- 
matical rules in much the same style as those shown in Figure ^ The abstract 
pattern '< %10 220 < %7 > < %9 > < %8 > >' describes the overall structure of 
this kind of sentence with slots that may receive individual words at appropriate 
points in the pattern. 

Notice how the symbol '%9' serves to mark 'b o y' and 'girl' as alternatives 
in the middle of the sentence. This is a grammatical class in the tradition of 
distributional or structural linguistics (see, for example, |HE|)- 

3.4 Sifting and Sorting of Patterns 

In the example just shown, all the patterns derived from the alignment are 'cor- 
rect'. But in many cases, patterns that are derived in this way and added to 
Old are 'wrong'. The wrong patterns are weeded out in the sifting_and-Sorting () 
stage of processing (operation 4 in Figure \^ , where the system develops one 
or more alternative grammars for the patterns in New in accordance with MLE 
principles. Figure |31 shows the overall structure of the sifting _and-Sorting() func- 
tion. 

Compiling a Set of Alternative Grammars A set of alternative grammars 
for the patterns in New that are good in terms of MLE principles are derived 
(in the compile-alternative-grammars () function) in operation 4 of Figure |S1 
Each grammar is a subset of the patterns that have been added to Old during 
operation 3 of Figure |3 

The process of compiling good grammars is essentially a hill-climbing search 
through the abstract space of alternative grammars, trying to minimise (G -I- E) 
for each grammar, where G is the size of the given grammar (in bits) and E is 
the size of all the New patterns (in bits) after they have been encoded in terms 
of the grammar. Minimising (G -I- E) is, of course, the central idea in grammar 
induction using MLE principles. In what follows, (G -|- E) is abbreviated as T. 



SIFTING_AND_SQRTING() 
{ 

1 For each pattern in Old, set its frequency of occurrence to 0. 

2 While (there are still unprocessed patterns in New) 
{ 

2.1 Identify the first or next pattern from New as the CPFN. 

2.2 Apply the function CREATE_MULTIPLE_ALIGNMENTS () to 

create multiple alignments, each one between the CPFN 
and one or more patterns from Old. 

2.3 From amongst the best of the multiple alignments formed, 

select 'full' alignments in which all the symbols of 
the CPFN are matched and all the C-symbols are 
matched in each pattern from Old. 

2.4 For each pattern from Old, count the maximum number of 

times it appears in any one of the full alignments 
selected in operation 2.3. Add this count to the 
frequency of occurrence of the given pattern. 

} 

3 Compute frequencies of symbol types and their encoding costs. 

From these values, compute encoding costs of patterns in 
Old and new compression scores for each of the full 
alignments created in operation 2. 

4 Using the alignments created in 2 and the values computed in 

operation 3, COMPILE_ALTERNATIVE_GRAMMARS () . 

} 

Fig. 5. The organisation of the sifting _and_sorting() function. The com- 
pile-alternative-grammars () function is described in the text. 

The grammars are built in stages, at first trying to minimise T for the first 
New pattern alone, then trying to minimise T for the first and second New 
pattern, followed by the first, second and third, and so on. 

4 Computational Complexity 

In a serial processing environment, the time complexity of SP70 is approximately 
0(A^^) where N is the number of patterns in New. In a parallel processing 
environment, the time complexity may approach 0{N), depending on how well 
the parallel processing is applied. In serial or parallel environments, the space 
complexity should be 0(iV). 

The time complexity of the program may be improved when it has been 
developed, as envisaged, so that the New patterns are processed in batches, with 
a purging of Old between each batch to remove all patterns except those in the 
best grammar. In this case, the time complexity should be 0{N). 

5 Example 

When New contains the eight sentences shown in Figure (HJ the best grammar 
found by SP70 is the one shown in Figured 



thatboyrun 
thatgirlru 
thatboywal 
thatgirlwa 
someboyrun 
somegirlru 
someboywal 
somegirlwa 
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k s 
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k s 
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s 
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Fig. 6. Eight sentences supplied 



to SP70 as New. 



< "/.2 2 s o m e > 

< "/.2 3 t h a t > 

< "/.I 5 b o y > 



< •/.! 6 g i r 1 > 

< "/.3 4 r u n s > 



< "/.3 7 w a 1 k s > 

< 1 < 7.2 > < y.l > < 7.3 > > 



Fig. 7. The best grammar (in terms of MLE principles) that is found by SP70 
when New contains the eight sentences shown in Figure El 



5.1 Intermediate Results 

As the first phase of learning proceeds (operation 3 of Figure HJ, intermediate 
results are often much less tidy than the example shown in Section 13.31 For 
example, when Old contains only the first pattern shown in Figure the only 
alignment it can create is: 



Notice that the Old pattern (in row 1) is, in effect, the same pattern as the 
New pattern (in row 0) so it is not permissible to match 'o' in the New pattern, 
for example, with 'o' in the Old pattern because that would mean matching a 
given symbol with itself! 

From the alignment just shown, the program derives 'bad' patterns like '< 
%3 14 t h a >', '< %4 18 b o y r u n s >' and '< %4 17 h a t b o y r u n 
s >' and these are added to Old. However, as later patterns are processed, the 
repository of Old patterns begins to accumulate enough patterns that are good 
in MLE terms so that it is able to create quite respectable looking parsings like 
this: 
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In the sifting _and_sorting() phase, all the 'bad' patterns are discarded and 
the 'good' patterns are cleaned up by removing unnecessary ID-symbols and 
renaming the retained ID-symbols in a tidy manner. 



5.2 Values for G, E, T and Compression 

Figure ^ shows changing values for G, E and T for the best grammar found 
(in terms of MLE principles) as successive patterns from New are processed 
in compile-alternative-grammars(). It is interesting to see that, as successive 
patterns are processed, progressively more compression is achieved, represented 
by the falling values for (T / 'original'), shown in the last column. 



Pattern 


G 


E 


T 


Original 


Compression 


1 


7970.49 


26.78 


7997.27 


7943.70 


1.00 


2 


11085.38 


191.29 


11276.67 


16569.42 


0.68 


3 


14665.26 


302.09 


14967.35 


25195.14 


0.59 


4 


14665.26 


397.57 


15062.83 


34502.87 


0.44 


5 


17650.07 


563.32 


18213.39 


42488.08 


0.42 


6 


17650.07 


713.75 


18363.82 


51155.30 


0.36 


7 


17650.07 


887.00 


18537.07 


59822.52 


0.31 


8 


17650.07 


1044.92 


18694.99 


69171.76 


0.27 



Table 1. Cumulative values (in bits) of G, E and T for the best 
grammar found as successive patterns from New are processed in com- 
pile_alternative_grammars(). For comparison purposes, the cumulative sizes of 
the original patterns (excluding ID-symbols) are shown in the 'original' column 
and values for compression (T / 'original') are shown in the last column. 



6 Discussion 
6.1 Evaluation 

In accordance with the 'looks-good-to-me' approach to the evaluation of gram- 
mar induction systems [201, the grammar shown in Figure |3 looks like an appro- 
priate grammar for the patterns shown in FigureEl^ This may seem like a sloppy 

^ A possible improvement might be a grammar that isolates the 's' in 'r u n s' and 'w 
a 1 k s' as a separate morpheme. 



method of evaluation but it should not be forgotten that the human brain is, by 
a wide margin, the best learning system on the planet. This provides a justifica- 
tion for using human judgement of what does or does not 'look good' as a means 
of evaluating the output of artificial learning systems. With any system that is 
sufficiently robust to be applied to realistic samples of natural language, then 
there is no alternative to (human) judgements about what is or is not a 'correct' 
grammar for a given language or (human) conventions about how language is 
segmented into words. Statistical tests may be applied to establish whether or 
not there is a significant level of agreement between structures established by 
human judgement and the results of artificial learning .22 23 . 

Notice that the use of a 'target' grammar as a criterion of success (as in 
Gold's approach to learning ^) does not overcome the problem that, for any 
given language sample, there are many alternative grammars that are compatible 
with the sample and some are 'better' than others. 

6.2 Reorganisation Needed 

The example in the previous section is good enough to show that the approach 
is sound but experiments with other examples have shown that the model suffers 
from two main weaknesses: 

— Although the model in its current form can isolate basic segments and tie 
them together in an overall abstract structure, it is not good at finding 
intermediate levels of abstraction. 

— In the development of the model to date, no attempt has been made to enable 
the system to detect discontinuous dependencies such as number dependency 
between the subject of a sentence and its main verb (as mentioned in Section 
I2.1|) . Although this kind of capability may seem like a refinement that we can 
afford to do without at this stage of development, a deficiency in this area 
seems to have an impact on the program's performance at an elementary 
level. 

A possible solution to both problems is a reorganisation of the model so 
that learning is integrated even more closely with parsing. Recent work has 
shown that operation 2.2 in the sifting_and_sorting() function (FigureEl) can be 
omitted — the multiple alignments from operation 3.2 in Figure El can be used 
instead. It is also envisaged that New patterns will be processed in batches and 
that, after each batch, sifting _and_sorting() will be applied and Old patterns 
that are not proving useful will be discarded. 

7 Conclusion 

SP70 is not yet an 'industrial strength' system for unsupervised learning but I 
believe the framework has considerable potential and provides a sound basis for 
further development. 



A key attraction of this approach to learning is that the ICMAUS framework 
provides a unified view of a variety of issues in AI thus facilitating the integration 
of grammar induction with other aspects of intelligence. Given the generality of 
the framework, the learning techniques described here are relevant to the learning 
of any kind of knowledge, not just grammars. 
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